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ABSTRACT 

Business Process Management Systems (BPMS) log events 
and traces of activities during the execution of a process. 
Anomalies are defined as deviation or departure from the 
normal or common order. Anomaly detection in business 
process logs has several applications such as fraud detection 
and understanding the causes of process errors. In this pa¬ 
per, we present a novel approach for anomaly detection in 
business process logs. We model the event logs as a sequen¬ 
tial data and apply kernel based anomaly detection tech¬ 
niques to identify outliers and discordant observations. Our 
technique is unsupervised (does not require a pre-annotated 
training dataset), employs kNN (k-nearest neighbor) ker¬ 
nel based technique and normalized longest common subse¬ 
quence (LCS) similarity measure. We conduct experiments 
on a recent, large and real-world incident management data 
of an enterprise and demonstrate that our approach is effec¬ 
tive. 
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1. RESEARCH MOTIVATION AND AIM 

Business Process Management Systems (BPMS), Work- 
flow Management Systems (WMS) and Process Aware In¬ 
formation Systems (PAIS) log events and activities during 
the execution of a process. Process Mining is a relatively 
young and emerging discipline consisting of analyzing the 
event logs from such systems for extracting knowledge such 
as the discovery of runtime process model (discovery), check¬ 
ing and verification of the design time process model with 
the runtime process model (conformance analysis) and im¬ 
proving the business process (recommendation and exten¬ 
sion) [3[^. A process consists of cases or traces. A case 
consists of events. Each event in the event log relates to 
precisely one case. Events within a case are ordered and 
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have attributes such as activity, timestamp, actor and sev¬ 
eral additional information such as the cost. The traces and 
activities in event logs can be modeled as sequential and 
time-series data. 

Anomaly detection in business process logs is an area that 
has attracted several researcher’s attention. Anomalies are 
patterns in data that do not conform to a well dehned notion 
of normal behavior. Anomaly detection in business process 
logs has several applications such as fraud detection, identi¬ 
fication of malicious activity and breakdown of the system 
and understanding the causes of process errors. We conduct 
a literature review of papers closely related to the work pre¬ 
sented in this paper. Rogge-Solti et al. propose a Bayesian 
model that can be automatically inferred from the Petri-Net 
representation of a business process and is then used to de¬ 
tect non-obvious and temporal anomalies |^. Bezerra et al. 
propose and compares three algorithms for detecting anoma¬ 
lies in logs of process aware systems: threshold, iterative and 
sampling algorithm. They evaluate the performance of their 
algorithms on a set of 1500 artificial logs and demonstrate 
the effectiveness of their approach . 

The foeus of the study presented in this paper is on anomaly 
detection in business process logs. We present a different and 
fresh perspective to stated problem and our work is moti¬ 
vated by the need to extend the state-of-the-art in the field 
of techniques anomaly detection in business process event 
logs. We model the event logs as a sequential data and 
apply kernel based anomaly detection techniques (which is 
significant departure from previous approaches) to identify 
outliers and discordant observations [^. The researeh aim 
and eontrihutions of the work present in this paper is the 
following. 

1. To investigate kernel based sequential data anomaly de¬ 
tection based techniques for detecting anomalies and 
outliers in business process event logs. While there 
has been work done in the area of anomaly detection 
in business process logs, to the best of our knowledge, 
the work presented in this paper is the first foeused 
study on application of kernel based sequential data 
anomaly detection based methods to solve the given 
problem. 

2. To conduct in-depth empirical analysis on real-world 
dataset and demonstrate the effectiveness of our pro¬ 
posed approach. We conduct experiments on a recent, 
large and real-world incident management data of an en¬ 
terprise. The analysis presented in this paper is the first 
study on such a dataset for the application of anomaly 
detection. 
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Figure 1: Proposed solution approach called as Nirikshan consisting of a processing pipeline from raw data 
transformation to anomaly detection 


Table 1: Actor, Activity and Timestamp for one of 
the Cases in the Dataset 


DateStamp 

Activity 

Group 

7/1/2013 8:17 

Reassignment 

01 

4/11/2013 13:41 

Reassignment 

02 

4/11/2013 13:41 

Update from oust 

02 

4/11/2013 12:09 

Operator Update 

03 

4/11/2013 12:09 

Assignment 

03 

4/11/2013 13:41 

Assignment 

02 

4/11/2013 13:51 

Closed 

03 

4/11/2013 13:51 

Caused By Cl 

03 

4/11/2013 12:09 

Reassignment 

03 

25/09/2013 08:27 

Operator Update 

03 


2. EXPERIMENTAL DATASET 

We conduct our study on large real-world publicly avail¬ 
able dataset so that our experiments can be replicated and 
the results can be used for comparison or benchmarking pur¬ 
poses. The work presented in this paper holds the required 
replication standards ensuring sufficient information for any 
third party to replicate the results without any additional 
information from us. We conduct experiments on the pub¬ 
licly available dataset provided by the tenthQ International 
Workshop on Business Process Intelligence (BPI). Data col¬ 
lection is one of the most important stage in conducting 
qualitative research and the quality of result obtained de¬ 
pends both on research design and data gathered. The data 
provided on the BPI workshop website is of high quality as it 
is peer-reviewed and prepared by experts on the given topic. 
As an academic, we believe and encourage academic code or 
software sharing in the interest of improving openness and 
research reproducibility. We release our code and dataset 
in public domain so that other researchers can validate our 
scientific claims and use our tool for comparison or bench¬ 
marking purposes (and also reusability and extension). Our 
code and is hosted on GitHut[^not mentioned due to blind- 
review policy] which is a popular web-based hosting service 
for software development projects. 

3. SOLUTION APPROACH 

Figure shows the high-level architecture of the proposed 
solution approach (called as Nirikshan) consisting of 3 steps. 

^ http://www.win.tue.nl/bpi/2014/challenge 
^currently not mentioned due to blind review policy 


The three steps (data transformation, anomaly score distri¬ 
bution computation and application of kNN method) are 
labeled as A, B and C respectively. The Rabobank Group 
IGT Incident Dataset consists of 46616 incidents or cases 
and 466737 events. The fields in the event-log dataset are: 
Incident ID, TimeS-stamp, Incident Activity Number, Inci¬ 
dent Activity Type, Assignment Group and KM number (a 
number related to knowledge document). Tableshows the 
Actor, Activity and Timestamp for one of the Gases in the 
Dataset. The even-log data in Tableshows that several ac¬ 
tivities are performed by various actors during the workflow 
and process enactment. Table shows that the data has 
a sequential aspect (is a nature and characteristics of the 
business process log) and hence we believe techniques for 
anomaly detection for sequences can be applied to the event 
log data. While the sequence in the given example is multi¬ 
variate, in this work, we consider only the activity attribute 
and model the sequence as univariate. Each case consisting 
of several events is represented as a sequence of symbols (re¬ 
fer to Phase A of the solution approach in Figure [^. Each 
unique activity is mapped to a symbol. There are 39 differ¬ 
ent kinds of activities in the dataset and hence there are 39 
different symbols. Some of the example of activities are: Re¬ 
ferred (REF), Problem Glosure (PG), GO Response (OOR), 
Dial-In (DI) and Gontact Ghange (GG). The sequences are 
of different length. There are a total of 46616 sequences in 
the dataset. 

Problem Definition: In our application, there is no ref¬ 
erence or training database available containing only nor¬ 
mal sequences. Hence, our task is to detect anomalous se¬ 
quences (mapped from cases) from an unlabeled database 
of sequences. The problem is of unsupervised anomaly detec¬ 
tion. A formal representation of the problem is [^: Given a 
set of n sequences, S = {Si, S 2 , ■■■, Sn}, find all sequences in 
S that are anomalous with respect to rest of S. 

In the unsupervised anomaly detection approach, the en¬ 
tire dataset is treated as a training dataset and a anomaly 
score is assigned to each sequence with respect to this train¬ 
ing dataset (based on the assumption that the training dataset 
contains few anomalous sequences) (refer to Phase B of 
the solution approach in Figure[^. We hypothesize that ker¬ 
nel based techniques (which define an appropriate similarity 
kernel for the sequences) can be used to detect anomalies for 
the given dataset and application domain. K-nearest neigh¬ 
bor (kNN) is a well-known and widely used kernel based 
technique based on a point based anomaly detection algo¬ 
rithm. The main idea behind kNN kernel based technique is 
to compute the anomaly score for every data point which is 
equal to the inverse of its similarity to its nearest neigh- 




















KNN (K=5000) (n=47641) 



15554- 


KNN (K=2500) (n=47641) 



Figure 2: Histogram and kernel density estimate 
for the anomaly score variable (K value for KNN 
= 5000) 

bor in the training dataset S (refer to Phase C of the solution 
approach in Figure [^. Once the anomaly score of each data 
point is computed, outliers can be detected by identifying 
the points with high anomaly scores or data points which 
are 0 (a predefined threshold) standard deviation away from 
the mean of the anomaly score dataset (assuming the data 
follows a Gaussian or normal distribution). Kernel based 
techniques require a similarity kernel. Length of the longest 
common subsequence (LCS) has been widely used as a sim¬ 
ilarity kernel for computing the distance (or extent of sim¬ 
ilarity) between two given sequences. We apply normalized 
LCS (nLCS) as a similarity measure between two sequences 
(which can be of unequal length) Sp and Sq. The formula 
for nLCS is shown in Equation^ 

nLCS{Sp, Sg) = ( 1 ) 

4. EMPIRICAL ANALYSIS 

We apply kNN based anomaly detection technique with 
two experimental parameters: k = 5000 and k = 2500. We 
first identify the statistical and density distribution of the 
anomaly score dataset and check if it has a Gaussian or 
normal distribution. Figure shows the histogram plot di¬ 
viding the horizontal axis into sub-intervals or bins covering 
the range of the data from a minimum of 1.49 to a maximum 
of 5.19. The size of the data sample for the histogram and 
density distribution is the entire population. The solid blue 
curve is the kernel density estimate which is a generaliza¬ 
tion over the histogram. We use kernel density estimation 
to estimate the probability density function of the anomaly 
score variable. In Figure the data points are represented 
by small circles on the x-axis. We observe that the data has 
a Gaussian distribution. The smoothing parameter (band¬ 
width) for the kernel density estimate in Figure is 0.15. 
The mean (/x), variance (cr^) and standard deviation (cr) for 
the data is 1.998, 0.093 and 0.306 respectively. We identify 
anomalies (also called as outliers or discordant observations) 
by using a standard distance metric to determine how far 
away each point is from the normal datc[^ The anomalies 
are marked in the Figure The top 5 anomaly scores are: 

^ http://trevorwhitney.com/data_mining/anomaly_detection 


Figure 3: Histogram and kernel density estimate 
for the anomaly score variable (K value for KNN 
= 2500) 



Figure 4: Fragment of the discovered process map 
using DISCO process mining tool at a resolution 
showing only core transitions 

5.196, 4.516, 4.467, 3.968 and 3.939. We check how far the 
data points fall from the mean (also called as the expected 
value) of the data and how many standard deviations away 
from the mean that a point is in the dataset. We compute 
the z-score (calculated using the formula z = ) for each 

point which is a measure of how many standard deviations 
a data point is away from the mean of the data. Any data- 
point (our interest is on points on the right side of the mean 
in the given context) that has a z-score higher than 5 is an 
outlier, and likely to be an anomaly. The points become 
more obviously anomalous as the z-score increases above 5. 
We found 21 points with a z-score of more than 5. The top 
5 z-scores are: 10.451, 8.230, 8.070, 6.439 and 6.345. 

Similarly, we apply the same procedure by setting k = 
2500. Figure (k = 2500) shows the histogram plot and 
kernel density estimate covering the range of the data from 
a minimum of 1.29 to a maximum of 4.58. We observe that 
the data has a Gaussian distribution. The mean (/x), vari¬ 
ance (cr^) and standard deviation (a) for the data is 1.7692, 
0.0797 and 0.2823 respectively. The top 5 anomaly scores 
are: 4.582, 4.127, 4.106, 3.464 and 3.325. The top 5 z-scores 
are: 9.966, 8.354, 8.278, 6.004 and 5.513. 

We use DISGC|^to discover the run-time process model 

^ http://fluxicon.com/disco/ 
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Figure 5: Histogram for events per case 


Figure 6: Histogram for case variants 


Table 2: Anomalies extracted from the proposed solution approach (kNN kernel based techniqie) 


1 

35;35;35;35;9;25;25;25;1;27;1;4;4;9;4;9;9;9;9;9;25; 25;25;11;25;25;25;4;20;9;4;35;35;35;35;35;35;25;35;35;35; 27; 

2 

27; 6; 27; 9; 4; 4; 16; 6; 27; 16; 27; 27; 16; 6; 27; 27; 27; 6; 4; 27; 27; 4; 4; 17; 27; 27; 27; 27; 4; 4; 4; 16; 27; 6; 27; 16; 18; 6; 25; 9; 
35; 20; 35; 6; 20; 27; 27; 27; 27; 27; 27; 27; 27; 27; 4; 11; 27; 27; 27; 27; 27; 24; 6; 16; 4; 16; 4; 4; 27; 27; 27; 16; 27; 27; 27; 27; 16; 
16; 2; 16; 6; 16; 27; 35; 25; 27; 27; 4; 35; 4; 6; 35; 27; 27; 4; 6; 27; 27; 27; 35; 6; 27; 27; 20; 35; 16; 35; 4; 4; 16; 27; 27; 16; 27; 27; 
16; 27; 27; 4; 16; 4; 27; 5; 4; 27; 35; 6; 4; 25; 16; 16; 16; 27; 35; 4; 27; 16; 27; 27; 4; 16; 16; 27; 6; 6; 27; 16; 27; 6; 27; 27; 27; 27; 4; 

9; 27; 16; 27; 6; 6; 6; 9; 4; 17; 27; 27; 27; 27; 6; 27; 

3 

20; 27; 0; 35; 32; 25; 18; 3; 20 

4 

25; 25; 11; 18; 27; 0; 

5 

0; 27; 


from the given dataset and verify if the anomalous cases 
identified by our technique matches with the ones extracted 
by the DISCO tool. Figure shows the fragment of the 
process model extracted from DISCO (due to limited space 
it is not possible to display the entire process model). The 
discovered process model consists of nodes representing the 
activities in the dataset and directed edges representing the 
transitions between nodes. The color of the node (and edge 
thickness) is proportional to the frequency of the activity 
(darker for more frequency). As shown in Figure® activity 
label 0 is dark in color as it has large number of incoming 
transitions. There are 39 different activities indexed from 
0 to 38. The index used is: Caused By Cl [0], Reopen [1], 
Prob. Work. [2], External Vendor Assig. [3], Op. Up¬ 
date [4], Urgency Change [5], Comm, customer [6], Impact 
Change [7], Quality Ind. Fixed [8], Update [9], Anal./Res. 
[10], Desc. Update [11], External update [12], Pend ven¬ 
dor [13], Prob. Closure [14], Callback Request [15], Update 
customer [16], Notify By Change [17], Open [18], Dial-in 
[19], Status Change [20], Affected Cl Change [21], Mail Cust 
[22], Referred [23], Contact Change [24], Reassignment [25], 
Comm, vendor [26], Closed [27], Resolved [28], Quality In¬ 
dicator Set [29], Vendor Ref Change [30], Quality Indicator 
[31], Vendor Ref [32], Incident repr [33], External Vendor 
Reass [34], Assig [35], 00 Response [36], alert stage 1 [37], 
Service Change [38]. Eigurej^ shows the distribution of the 
number of events per case. Eigure shows the distribu¬ 
tion of the case invariants. Both the distributions displayed 
in Eigure and are skewed as one of the tails is longer 
than the other. Both the distributions has a positive skew 
as the long tail is in the positive direction. Distribution 
in Eigure reveals that most of the cases has few events 
whereas a small number of cases consists of large number of 
events. Similarly, the distribution in Eigure [^indicates that 
the mean and median of the case variants is greater than 
the mode and the dataset consists of a long of small number 
of case invariants. Table shows 5 of the top 10 anomalies 
extracted by our approach. We validate it with the output 
of the DISCO infrequent case variants. 


5. CONCLUSIONS 

We present a technique to detect anomalies from business 
process event logs. We apply KNN kernel based sequential 
anomaly detection based method and conduct experiments 
on real-world dataset. We validate the effectiveness of the 
proposed approach and conclude that kernel based sequen¬ 
tial data anomaly detection techniques can be effectively 
applied for the domain of extracting outliers from business 
process event logs. We learn that similarity kernel used (for 
example, nLCS) in the proposed technique and the value of 
the kNN parameter (A) has an effect on the outcome. 
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